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As studies of DNA methylation increase in scope, it has become evident that methylation has a complex relationship 
with gene expression, plays an important role in defining cell types, and is disrupted in many diseases. We describe 
large-scale single-base resolution DNA methylation profiling on a diverse collection of 82 human cell lines and tissues 
using reduced representation bisulfite sequencing (RRBS). Analysis integrating RNA-seq and ChlP-seq data illumi- 
nates the functional role of this dynamic mark. Loci that are hypermethylated across cancer types are enriched for 
sites bound by NANOG in embryonic stem cells, which supports and expands the model of a stem /progenitor cell 
signature in cancer. CpGs that are hypomethylated across cancer types are concentrated in megabase-scale domains 
that occur near the telomeres and centromeres of chromosomes, are depleted of genes, and are enriched for cancer- 
specific EZH2 binding and H3K27me3 [repressive chromatin). In noncancer samples, there are cell-type specific 
methylation signatures preserved in primary cell lines and tissues as well as methylation differences induced by cell 
culture. The relationship between methylation and expression is context-dependent, and we find that CpG-rich en- 
hancers bound by EP300 in the bodies of expressed genes are unmethylated despite the dense gene-body methylation 
surrounding them. Non-CpG cytosine methylation occurs in human somatic tissue, is particularly prevalent in brain 
tissue, and is reproducible across many individuals. This study provides an atlas of DNA methylation across diverse 
and well-characterized samples and enables new discoveries about DNA methylation and its role in gene regulation 
and disease. 



[Supplemental material is available for this article.] 

In the early 1980s, several groups observed that the covalent ad- 
dition of a methyl group to the cytosine base in mammalian ge- 
nomic DNA at certain loci is associated with differential gene 
expression of nearby genes (Razin and Riggs 1980; Sutter and 
Doerfler 1980; van der Ploeg and Flavell 1980). This led to decades 
of research deciphering the patterns and purpose of this fifth DNA 
base in the human genome. It is an ongoing challenge to map the 
locations of methylated cytosines across the genome and to un- 
derstand their roles in cell-type specific gene regulation (Ghosh 
et al. 2010), the establishment of gene expression patterns for 
stable differentiation (Lei et al. 1996; Okano et al. 1999; Jackson 
et al. 2004; Blelloch et al. 2006; Ji et al. 2010; Kim et al. 2010), and 
disease, including cancer where vast changes in DNA methylation 
patterns occur (Laird and Jaenisch 1994; Ushijima 2005; Sharma 
et al. 2010; Tsai and Baylin 2011). Bisulfite sequencing provides 
the most direct and highest resolution method to quantify DNA 
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methylation in the genome, enabling the ability to calculate the 
fraction of molecules that are methylated at each individual 
cytosine sequenced (Frommer et al. 1992). The advent of next- 
generation DNA sequencing technologies has prompted the de- 
velopment of methods that take advantage of this vastly increased 
throughput, using bisulfite sequencing to query large subsets of 
the human genome (Meissner et al. 2008) and even whole human 
genomes (Cokus et al. 2008; Lister et al. 2009; Laurent et al. 2010; 
Li et al. 2010; Hansen et al. 2011; Berman et al. 2012; Hon et al. 
2012). 

The goal of this study was to generate a high-quality com- 
pendium of DNA methylation data across a large number of hu- 
man cell lines and tissues. Reduced representation bisulfite se- 
quencing (RRBS) was chosen because it provides quantitative, 
single-base resolution methylation profiles for a large subset of the 
human genome that is enriched for genie regions and CpG islands 
(CGIs) (Meissner et al. 2008). Other genomic assays have been 
performed in these samples as part of The ENCODE Project (The 
ENCODE Project Consortium 2007, 2011, 2012), providing a rich 
resource for integrated analysis of DNA methylation, gene ex- 
pression, transcription factor binding, and chromatin modifica- 
tions. We demonstrate that comparisons of methylation profiles 
across the diverse collection of samples in this study can be used to 
investigate cancer-associated methylation defects, cell-type specific 
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methylation, and epigenetic changes induced by cell culture. We 
quantify the reciprocal relationships of promoter and gene body 
methylation to expression levels and present the identification of 
a DNA methylation signature of intragenic (gene body) tran- 
scriptional enhancers marked by EP300 that clarifies the in- 
terpretation of gene body methylation. We report the discovery 
of reproducible non-CpG cytosine methylation in human so- 
matic tissue, particularly in adult human brain tissues. The data 
described here provide an atlas of DNA methylation for in- 
vestigating how this epigenetic mark relates to other molecular 
and phenotypic characteristics within these diverse cell types, 
including many commonly used cell line models. The data are 
readily available for visualization and analysis using the UCSC 
Genome Browser (Fujita et al. 2011; Raney et al. 2011) and will 
provide a valuable resource for future comparisons to other cell 
types, disease states, and functional genomic assays. 

Results 

Quantifying DNA methylation 

We modified a previously published protocol for RRBS (Meissner 
et al. 2008) to create a streamlined workflow for this larger-scale 
implementation (Supplemental Fig. SI). For each reference cyto- 
sine sequenced, we computed the percentage of reads in which the 
cytosine was methylated (remained a C after bisulfite treatment) 
out of the total reads covering that position (Supplemental Fig. S2). 
This percent methylated (PM) value represents the percentage of 
molecules that were methylated at each cytosine. 

In replicate growths of the human myeloid cell line K562, we 
found that lower read depth provided less reproducible measure- 
ments of PM between replicates. When restricted to CpGs with at 
least 10 X coverage, the reproducibility of PM measurements across 
the full range of values improved between replicates (r = 0.987) and 
resulted in an average of 3.96 PM difference per CpG between 
replicates (Supplemental Fig. S3 A). The resulting PM measure- 
ments were highly correlated with values obtained from an array- 
based methylation assay (Illumina Methyl450K, r = 0.954) (Sup- 
plemental Fig. S3B). We performed RRBS on 82 human cell lines 
and tissues in duplicate (sample information in Supplemental 
Table SI) and obtained at least 10 x coverage for an average of 
700,000 CpGs in each sample. The Mspl restriction digest utilized 
for RRBS enriches for CpG-dense regions of the genome, in- 
cluding genes (twofold enrichment) and CpG islands (111-fold 
enrichment). 

Global observations 

We found that all samples, regardless of the disease state or tissue 
type, had similar distributions of methylation among the assayed 
CpGs (mean of pairwise R 2 = 0.96) (Supplemental Fig. S3C). In each 
sample, 5%-15% of assayed CpGs were completely methylated 
(PM > 90), and 65%-80% of assayed CpGs were unmethylated 
(PM < 10). This consistency could appear because the same CpGs 
are always methylated in all samples, or it could result from the 
same net amount of methylation placed on different loci between 
samples. Our data set supports the latter; the PM values of in- 
dividual CpGs varied substantially across cell lines and tissues. 
Only 4% (27,053) of CpGs are unmethylated (<10 PM) across all 
cell lines and tissues that we assayed, and these are located near the 
transcription start sites (x 2 , P< 0.0004) of genes with housekeeping 
functions (P < 1.35 X 10" 10 ). The remaining 670,000 CpGs that we 



assayed exhibited differential methylation in this study, providing 
a rich data set for investigating epigenetic patterns. 

To characterize cell-type specific methylation patterns, we 
analyzed 440,974 autosomal CpGs with at least 10x coverage in at 
least 90% of the samples. We performed unsupervised hierarchical 
clustering on the PM values for the top 5% of CpGs with the most 
variable methylation (N= 22,696, a > 32.6). The samples all paired 
with their replicates and clustered into clades with distinct meth- 
ylation patterns, and these clades corresponded to distinct types of 
samples, namely cancer cell lines, primary cell lines, tissues, and 
blood leukocytes (Fig. 1A; detailed tree in Supplemental Fig. S4). 
We divided these most variable CpGs based on whether they are 
located in gene regulatory regions (<2000 bp from the transcrip- 
tion start site) or in the body of genes (>2000 bp from the tran- 
scription start site) and found that both subsets recapitulate the 
classification of samples in the four clades (Supplemental Fig. S5). 
To determine if the detection of cell-type specific differences would 
be confounded by epigenetic variability introduced when cell lines 
were grown in different ENCODE Consortium laboratories, we 
obtained growth replicates of the same cell lines from different 
laboratories. We found that they clustered together based on cell 
type, not laboratory. Furthermore, for a particular cell type, repli- 
cates from within a lab and replicates grown in distant labs were 
equally similar (Supplemental Fig. S6). 

Aberrant methylation across cancer cell lines 

The dominant signal in this data set is cancer-specific hyper- 
methylation found at 66,570 CpGs in the cancer cell lines (Fig. 1A), 
including lines derived from breast, prostate, lung, ovarian, en- 
dometrial, liver, and pancreatic cancer, as well as neuroblastoma 
and several leukemias (Kolmogorov-Smirnov, P < 1 X 10~ 7 ). Of the 
loci that are hypermethylated across cancers, 48,787 (73%) reside 
in CGIs (Table 1) and represent a significant portion of the 148,465 
island CpGs assayed (33% vs. expected 15%, Fisher's exact test, 
P < 0.05). The observation of hypermethylation at CGIs across the 
genome, regardless of their proximity to genes, in 18 diverse cancer 
cell lines is consistent with reports of a CpG island methylator 
phenotype (CIMP) that was first described in colorectal cancer 
(Toyota et al. 1999) and has since been documented in many 
cancer types (Teodoridis et al. 2008; Liu et al. 2011; Chen et al. 
2012; Turcan et al. 2012). An additional 7377 (11%) nonisland 
CpGs are significantly hypermethylated in cancer, and these reside 
in promoters and bodies of genes encoding proteins with se- 
quence-specific DNA binding transcription factor activity (hyper- 
geometric enrichment for GO:0003700, P= 5.4 X 10" 15 ) (Table 1). 
The presence of increased methylation in both the promoter and 
gene body of these transcription factor genes indicates dysregula- 
tion of methylation at these genes in cancer cell lines. 

It has been observed that hypermethylation in cancer is 
enriched at loci that in embryonic stem cells (ESCs) are unmeth- 
ylated, have bivalent chromatin marks, and are reversibly re- 
pressed by the Polycomb Repressive Complex. This observation 
has led to the model of a stem/progenitor cell signature in cancer 
(Ohm et al. 2007; Schlesinger et al. 2007; Widschwendter et al. 
2007; Easwaran et al. 2012). We determined the overlap between 
the loci that are hypermethylated across the cancer cell lines in our 
data set and the location of binding sites for 149 transcription 
factors that were assayed by chromatin immunoprecipitation se- 
quencing (ChlP-seq) experiments performed by our lab and others 
in The ENCODE Project Consortium. For each transcription factor 
that overlapped loci that are hypermethylated in cancer cell lines, 
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Figure 1. Methylation patterns distinguish cell types and reveal aberrant hypermethylation across cancers. (A) Unsupervised hierarchical clustering of 
the top 5% of CpGs with the most varying methylation across 82 samples distinguishes four major clades, identified as cancer cell lines, tissues, primary cell 
lines, and blood leukocytes. (B) Loci that are hypermethylated across cancers are significantly enriched for sites that are bound by NANOG in embryonic 
stem cells. UCSC Genome Browser visualization of the SFRP2 gene showing DNA methylation data, NANOG binding sites in the embryonic stem cell line 
HI -hESC (HI -hESC), and RNA-seq data. The color in the RRBS track indicates the percent of molecules that are methylated at each CpG position. (Red) 
1 00%, (yellow) 50%, (green) 0%. Hypermethylation across the cancers occurs in the SFRP2 gene promoter where NANOG, a transcription factor, binds in 
HI -hESC. NANOG binding in HI -hESC is visualized as green peaks in both ChlP-seq replicates, and peak boundaries are depicted as black and gray boxes 
below the raw signal (darker boxes indicate a more significant peak). The RNA-seq data demonstrate that SFRP2 is expressed in HI -hESC and not expressed 
in the cancer cell lines (HeLa, HepG2, MCF7, and HCT1 1 6). 



we report the enrichment and statistics in Supplemental Table S2. 
Consistent with previous reports, we found that hypermethylated 
CpGs have a significant overlap with loci that are bound by SUZ12 



in ESCs, a component of Polycomb Repressive Complex 2 (Fisher's 
exact test, Benjamini-Hochberg (BH)-adjusted, P= 9.27 X 10" 142 ). 
The corepressor CTBP2 was also enriched at these sites and sig- 
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Table 1. Genomic context of cell-type specific methylation 



Near TSS (<2 Kbp) Far from TSS in gene body Intergenic 





CGI 


Hypo 


Hyper 


Hypo 


Hyper 


Hypo 


Hyper 


Cancer-specific 


+ 


78 (03%) 


26,434 (99.7%) 


1 30 (1 .5%) 


8701 (98.5%) 


291 (2.1%) 


1 3,652 (97.9%) 






673 (15.5%) 


3657 (84.5%) 


1 628 (30.4%) 


3720 (69.6%) 


3890 (51.1%) 


3716 (48.9%) 


Blood-specific 


+ 


59 (4.2%) 


1 359 (95.8%) 


1 03 (1 1 .9%) 


763 (88.1%) 


50 (8.9%) 


511 (91.1%) 






265 (28.3%) 


673 (71.7%) 


456 (29.9%) 


1069 (70.1%) 


360 (29.9%) 


843 (70.1%) 


Tissues vs. 


+ 


71 (44.9%) 


87 (55.1%) 


257 (100%) 


0 (0%) 


43 (43.4%) 


56 (56.6%) 


primary cell lines 




152 (33.7%) 


299 (66.3%) 


718(100%) 


0 (0%) 


239 (43.1%) 


316(56.9%) 



nificantly overlapped SUZ12-bound loci in ESCs (Fisher's exact 
test, BH-adjusted, P = 9.75 X 10~ 20 ). Additionally, we discovered 
that hypermethylated loci were also enriched for sites where 
NANOG is bound in ESCs, which is a transcription regulator es- 
sential for maintaining pluripotency in ESCs (Fisher's exact test, 
BH-adjusted, P = 4.06 X 10" 4 ). The hypermethylated NANOG 
binding sites do not overlap the SUZ12-bound loci, and the genes 
nearest the hypermethylated NANOG binding sites are enriched 
for genes that are overexpressed in human ESCs (P = 6.64 X 10~ 5 ) 
(Ben-Porath et al. 2008). These observations support the role of 
embryonic transcriptional regulators in directing aberrant meth- 
ylation that leads to cancer but add complexity to the model. In 
contrast to the polycomb repressive complex recruiting hyper- 
methylation in cancer to persistently silence differentiation- 
inducing genes, our results suggest that hypermethylation also 
occurs at NANOG-bound loci that are active in ESCs. While these 
are seemingly opposing effects, it is possible that both mechanisms 
lead to tumorigenesis by silencing genes that lead to differentia- 
tion, as well as silencing genes that control the stable, nonneo- 
plastic division of a pluripotent cell. This is supported by the ob- 
servation that NANOG binding site hypermethylation in cancer 
occurs at genes enriched for transcription factor activity (P = 9.78 X 
10~ 5 ), including developmental regulators such as FOXD3, a nega- 
tive regulator of cell cycle (Abel and Aplin 2010), and CDX2, whose 
silencing is associated with cancer progression (Huang et al. 2012; 
Knosel et al. 2012). Figure IB depicts another example, the SFRP2 
gene, a modulator of Wnt signaling, whose promoter is unmeth- 
ylated and bound by NANOG in the HI ESCs where the gene is 
expressed. In cancer, the NANOG binding site in the promoter is 
methylated, and the gene is silenced. 

Hypomethylation across the genome has been reported in 
cancer (Irizarry et al. 2009; Hansen et al. 2011; Berman et al. 2012; 
Hon et al. 2012). Although RRBS enriches for CpG-rich regions of 
the genome that tend to be hypermethylated in cancer, we also 
query thousands of positions in low CpG-density regions. We 
detected 6691 positions that were significantly hypomethylated 
across the cancer types when compared to the primary cell lines 
and tissues (Kolmogorov-Smirnov, P < 1 X 10~ 7 ). We discovered 
that these hypomethylated loci were colocalized in the genome 
and identified 114 independent megabase windows that had sig- 
nificantly more hypomethylated loci than expected, taking into 
account the nonrandom genomic coverage of the assay (Fisher's 
exact test, P = 2 X 10~ 5 ) (genomic coordinates listed in Supple- 
mental Table S3). These megabase-size hypomethylated domains 
were significantly depleted of genes (binomial, P < 3.48 X 10~ 40 ) 
and were significantly enriched in the 10-Mbp ends of chromo- 
somes and near the centromere, although they are not usually 
directly adjacent to the telomeres or centromeres (binomial, p = 
4.41 X 10~ 19 ) (Fig. 2A). Recent reports have found that some 



tumors exhibit hypomethylation corresponding to lamina- 
associated domains (Berman et al. 2012), but we did not observe an 
enrichment for lamin Bl binding in the hypomethylated domains 
we identified. It has also been reported that hypomethylated 
regions in cancer correspond to H3K27me3 (Hon et al. 2012; 
Statham et al. 2012), and we found that the domains we identi- 
fied overlap long tracks of H3K27me3, as well as correspond- 
ing stretches of EZH2 binding. We compared the presence of 
H3K27me3 in these regions between cancer cell lines and non- 
cancer cell lines and discovered that the long tracks of H3K27me3 
were specific to the cancer cell lines (x 2 test, P = 2.4 X 10~ 27 ). 
Brinkman et al. demonstrated that when DNA methyltransf erases 
are knocked-out in ESCs, broad local enrichments (BLOCs) of 
H3K27me3 appear in place of high levels of methylation (Brinkman 
et al. 2012). It is plausible that the same process is directing 
H3K27me3 BLOC formation at these hypomethylated domains 
in cancer. Among the ChlP-seq experiments comprising 149 
transcription factors, we did not identify any factors that were 
particularly enriched in these domains. Notably, these hypo- 
methylated domains occasionally contain a gene that is ex- 
pressed in specific cancer samples, and those samples exhibit 
gene-body methylation within the hypomethylated domain 
corresponding to the gene's expression. This indicates that gene 
expression and gene-body methylation are not occluded from 
these regions by the unmethylated tracks of H3K27me3 sur- 
rounding the gene. The prevalence of these domains across 
cancer types warrants further investigation of these regions as 
predictive biomarkers and to uncover the mechanisms driving 
these massive defects. Figure 2B depicts an example of a domain 
at the end of the q-arm of chromosome 22, where a 2-Mb gene- 
depleted region is specifically hypomethylated across cancer 
cell lines and exhibits long tracks of cancer-specific H3K27me3 
and EZH2 binding. This hypomethylated domain is flanked by 
methylation corresponding to the gene-body methylation of 
expressed genes. 

The single nucleotide resolution of bisulfite sequencing al- 
lows us to detect both methylation and DNA sequence variants 
in the same molecules, and we used this information to identify 
loci with allele-specific or allele-biased methylation (Gertz et al. 
2011). We identified 1144 CpGs that were adjacent to an allelic 
variant and exhibited allele-biased methylation in the noncancer 
samples from different tissues and individuals. We found that 
1027 (90%) of these CpGs, which are allelically methylated in 
noncancer samples, exhibit aberrant methylation in cancer cell 
lines (Kolmogorov-Smirnov, FDR-adjusted, P < 0.05) (Supple- 
mental Fig. S7A). This aberrant methylation occurs as either 
hypermethylation (gain of methylation on the unmethylated 
allele) (example in Supplemental Fig. S7B) or hypomethylation 
(loss of methylation on the methylated allele) (example in 
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Figure 2. Megabase-size domains are hypomethylated across cancers. (A) We identified 1 1 4 megabase windows in the genome that are significantly 
hypomethylated across cancer cell lines, compared to primary cell lines and tissues. These domains are enriched near the ends and centromeres of 
chromosomes. (B) UCSC Genome Browser visualization of a 2-Mb hypomethylated domain on the q-arm of chromosome 22. The color in the RRBS track 
indicates the percent of molecules that are methylated at each CpG position. (Red) 100%, (yellow) 50%, (green) 0%. Hypomethylation across cancers 
occurs in the 2-Mb gene-depleted region. RNA-seq demonstrates that the methylated regions flanking the cancer-specific hypomethylated domain 
contain genes that are expressed in both the cancer (HeLa, HepG2, and K562) and noncancer samples (GM1 2878 and HI -HESC). The chromatin ChlP-seq 
tracks demonstrate that the hypomethylated region is marked by cancer-specific repressive H3K27me3 and EZH2 binding (cancer = K562, HeLa, HepG2; 
noncancer = HMEC, GM12878, NH-A). 



Supplemental Fig. S7C). Loss-of-imprinting has been previously 
reported at particular imprinted loci in cancer (Feinberg et al. 2002; 
Bjornsson et al. 2007; Feinberg 2007; Monk 2010), and these 



observations demonstrate that the majority of allelically methyl- 
ated loci are dysregulated in cancer, regardless of whether they are 
imprinted. 
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Cell-type specific methylation 

When we isolated and subjected noncancer samples to un- 
supervised hierarchical clustering of the top 5% of CpGs with the 
most variable methylation (N = 22,152, a > 24.5), the samples 
again clustered into clades corresponding to distinct types of bi- 
ological samples: tissues, primary cell lines, embryonic cell lines, 
and blood leukocytes (Fig. 3A; detailed tree in Supplemental Fig. 
S8). The blood leukocyte-derived samples, including both periph- 
eral blood leukocytes and Epstein-Barr virus (EBV) -immortalized 
lymphoblastoid cell lines, clustered together and displayed a dis- 
tinct pattern of methylation at 6511 CpGs (Kolmogorov-Smirnov, 



5% most variable CpGs among non-cancer samples 




Percent Methylated 



O CO <D O |n. 

<- co (D in to <- t 




P < 1 X 10~ 7 ). Blood-specific hypermethylation occurs at CGIs 
(Table 1), but unlike the ubiquitous CGI methylation in cancer cell 
lines, only a select subset of CGIs (66%) exhibit hypermethylation 
when compared to tissue and primary cell lines. Epigenetic regula- 
tion of genes involved in blood leukocyte development was evi- 
dent and included specific methylation at genes involved 
in regulation of body fluid levels (GO:0050878, P = 3.25 X 1(T 4 ), 
blood coagulation (GO:0007596, P = 5.33 X 10" 4 ), hematopoietic or 
lymphoid organ development (GO:0048534, P = 8.52 X 10" 4 ), and B 
cell activation (GO:0042113, P = 9.24 X 1(T 4 ). 

We identified seven tissue types that were represented by 
both primary cell lines and primary tissues and performed ANOVA 
to identify CpGs whose methylation is 
significantly associated with the tissue of 
origin (regardless of whether it has been 
grown in culture). We identified 117,795 
CpGs whose methylation was signifi- 
cantly associated with the tissue of ori- 
gin and was consistent in both the pri- 
mary cell lines and the primary tissues 
(FDR < 0.05) (subset visualized in Fig. 3B). 
These data support previous observations 
that there is a large number of tissue- 
specific differentially methylated regions 
(tDMRs) (Rakyan et al. 2008) and that 
primary cell lines can serve as models for 
understanding epigenetic tissue-specific 
gene regulation at a large number of loci. 
However, as described above, we found 
that unsupervised hierarchical clustering 
of the most variable CpGs across sam- 
ples divided the primary cell lines from 
the tissues (Figs. 1A, 3 A). We identified 
2238 CpGs that significantly discriminate 
primary cell lines from tissue samples 
(Kolmogorov-Smirnov, P < 1 X 10~ 7 ) 
(Table 1; loci listed in Supplemental Ta- 
ble S4). These methylation differences 
associated with cell culture occur pre- 
dominately as unmethylated CpGs in 
the bodies of genes involved in regulat- 
ing cellular proliferation (GO:0042127, 
P = 5.17 X 10" 4 ). Studies of DNA methyl- 
ation that use cell lines as model systems 
could use this list to reduce false positives 
due to the epigenetic effects of cell culture. 
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Figure 3. Noncancer samples exhibit methylation differences associated with cell culture, as well as 
tissue-specific methylation that is preserved between primary cell lines and tissues. (A) Unsupervised 
hierarchical clustering of the top 5% of CpGs with the most varying methylation across noncancer 
samples separates clades of samples characterized as tissues, primary cell lines, embryonic cell types, and 
blood leukocytes. The tissues and primary cell lines are divided into separate clades by a cell culture- 
associated methylation signature. (B) Seven tissue types were represented by both primary cell lines and 
tissues in this data set (tissue types listed in legend). ANOVA identified 117,795 CpGs significantly 
associated with tissue of origin (FDR < 0.05). For this visualization, we performed unsupervised hier- 
archical clustering on the 3223 significant CpGs with the largest standard deviation of PM values across 
the samples (SD > 26). Both primary cell lines and primary tissues share a common tissue-specific 
methylation pattern, and the heat map displays the methylation patterns associated with each tissue of 
origin. Many CpGs are partially methylated in the tissues (black = 50%) at loci where the cell lines are 
completely methylated (yellow = 1 00%), indicating that heterogeneity among the cell types comprising 
the tissues results in a dampened signal compared to the pure cell population isolated in a cultured cell 
line (tissues marked by *, cell lines unmarked). 



Context-dependent DNA methylation 
signatures of gene expression 

The current models describing the rela- 
tionship between DNA methylation and 
gene expression indicate that promoter 
methylation is associated with gene si- 
lencing, and gene body methylation is 
associated with expression (Doerfler et al. 
1989; Jones and Baylin 2002; Lorincz 
et al. 2004; Ball et al. 2009; Illingworth 
et al. 2010; Maunakea et al. 2010; Aran et al. 
2011; Deaton et al. 2011). RNA-seq has 
been performed as part of The ENCODE 
Project on several of the samples included 
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in our DNA methylation study, providing an opportunity to ex- 
plore and quantify this relationship. We divided the CpGs near 
genes into four categories that were distinguished based on 
whether or not they reside in CGIs and whether they are near the 
TSS (<2 kb upstream or downstream) or far from the transcription 
start site in the gene body (>2 kb). We computed the Pearson 
correlation coefficient between RPKM values (reads per kilobase 
of transcript per million reads) (Mortazavi et al. 2008) measured 
by RNA-seq and PM values measured by RRBS. In this data set, the 
vast majority of CpGs close to the TSS, regardless of whether they 
reside in CGIs, are negatively correlated with gene expression 
(median r = -0.3756), i.e., increased methylation is associated with 
lower levels of gene expression (Fig. 4A,B). In contrast, the nonis- 
land CpGs far from the TSS in the body of the gene are positively 
correlated with expression (median r = 0.44450), i.e., increased 
methylation is associated with higher levels of gene expression 
(Fig. 4D). For these promoter and nonisland gene body CpGs, the 
current model of the relationship between methylation and ex- 
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Figure 4. Correlation between CpG methylation and gene expression depends on genomic context. 
(A,B) CpGs <2000 bp away from the transcription start site (TSS) are negatively correlated with ex- 
pression, regardless of whether they reside in a CpG island. (C) CpGs that are in gene bodies far from the 
TSS (>2000 bp away) and reside in CpG islands can be either positively or negatively correlated with 
gene expression. (D) CpGs that are in the gene body far from the TSS (>2000 bp away) and do not reside 
in CpG islands and are positively correlated with gene expression, (f ) Model of relationship between 
methylation and gene expression. Expressed genes are associated with unmethylated promoters, 
methylated gene bodies, and unmethylated intragenic CpG island EP300-bound enhancers. (F) Si- 
lenced genes are associated with methylated promoters, unmethylated gene bodies, and methylated 
intragenic CpG island enhancer elements. 



pression holds across these cell lines. However, we also identified 
an exception to the expected pattern: island CpGs residing in gene 
bodies displayed a bimodal distribution of correlation with ex- 
pression levels (Fig. 4C). This indicates that a substantial subset of 
CGIs in the gene body are negatively correlated with expression, 
similar to promoters, rather than positively correlated with expres- 
sion like other gene body CpGs. 

To investigate gene regulatory processes that could account 
for this finding, we integrated data sets from The ENCODE Proj- 
ect, including CAGE tag sequencing, histone modification, and 
transcription factor ChlP-seq. Recent studies have proposed that 
cryptic or alternative promoters, marked by H3K4me3 and CAGE 
tags, may appear as promoter-like methylation in gene bodies 
(Illingworth et al. 2010; Maunakea et al. 2010; Deaton et al. 2011). 
We found that H3K4me3 and CAGE tags account for 8.5% (479/ 
6155) of the gene body CGI CpGs that were negatively correlated 
with expression, providing evidence that this subset of CGIs in- 
deed reside in alternative promoters (Fisher's exact test, P-value = 
1.06 X 10" 7 [H3K4me3] and 0.009124 
[CAGE tags]). 

We sought to identify other function 
elements that could explain the remain- 
ing gene body CGIs that have epigenetic 
regulation similar to promoters. A recent 
study showed that DNA binding factors 
influence DNA methylation and that the 
methylation signature could be used to 
identify CpG-poor distal enhancers in the 
mouse genome (Stadler et al. 2011). The 
causes and consequences of gene-body 
methylation are not well understood, 
making it unclear whether active en- 
hancers could be unmethylated when 
embedded in the densely methylated 
gene body of an expressed gene. We in- 
vestigated whether the unmethylated 
CpG-rich regions that we observe within 
methylated gene bodies might be in- 
tragenic enhancers. We performed ChlP- 
seq for the transcriptional coactivator 
EP300, a factor known to bind to tran- 
scriptional enhancers (Visel et al. 2009). 
The vast majority of CpGs in EP300 
binding sites were unmethylated (PM < 
10; 99.3% in HepG2, 98.2% in GM12878, 
99.6% in hESC HI). When gene body 
CGIs contain EP300 binding sites, they 
are more strongly inversely correlated 
with gene expression than are CGIs not 
bound by EP300 (Kolmogorov-Smirnov 
test, P-value < 2.2 X 10" 16 ) (Supplemental 
Fig. S9). These EP300-bound CGI in- 
tragenic enhancers account for an addi- 
tional 8% (452/5676) of gene body CGIs 
that are negatively correlated with ex- 
pression. The signature of CpG-rich ac- 
tive enhancers in gene bodies was not 
observed at CpG-poor regions, indicating 
that this escape from gene-body methyl- 
ation is associated with CGIs. To identify 
other transcription factors associated 
with these unmethylated gene body CpG 
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islands that may reveal the role of the remaining loci that are not 
associated with EP300 binding, H3K4me3, or CAGE tags, we over- 
lapped them with the binding sites for the remaining 148 tran- 
scription factors in ChlP-seq data sets generated by The ENCODE 
Project Consortium. We did not identify any significant enrich- 
ment for specific transcription factors. The catalog of EP300-bound 
enhancers and other nonpromoter regulatory elements is not 
comprehensive. It is possible that these are enhancers bound by 
EP300 below the sensitivity of ChlP-seq, or they could be regu- 
latory regions bound by transcription factors or noncoding RNAs 
not yet studied by The ENCODE Project 
Consortium. 

Together, these results support a re- 
vised model of the expected DNA meth- 
ylation state found at expressed and si- 
lenced genes, depicted in Figure 4, E and F. 
In expressed genes, active intragenic 
enhancers bound by EP300 appear as 
patches of unmethylated CGIs amid the 
dense methylation found in the body of 
the expressed genes, and these enhancers 
have methylation that is concordant with 
the unmethylated CpGs several thousand 
base pairs away in the promoter and near 
the TSS. The reverse patterns are associ- 
ated with silenced genes. This revised 
model changes our expectations of the 
type of methylation we find in gene re- 
gions and helps to more accurately in- 
terpret gene body methylation. 



Non-CpG cytosine methylation 

DNA methylation predominately occurs 
at CpG dinucleotides in the human ge- 
nome, but there have been recent re- 
ports that non-CpG cytosine methyla- 
tion occurs at a lower, but appreciable, 
level in embryonic and pluripotent cells 
(Lister et al. 2009; Ziller et al. 2011). We 
identified 56,287 cytosines that were 
not located at CpG positions in the ref- 
erence genome that exhibited methyla- 
tion (PM > 10) in at least one sample. We 
eliminated 30,773 (55%) of these meth- 
ylated cytosines from further analysis 
because they were adjacent to genetic 
variants in these samples that created a 
CpG dinucleotide that became methyl- 
ated, a finding that suggests that there is 
a large number of polymorphic CpGs that 
show epigenetic diversity between in- 
dividuals. To reduce false-positives due to 
stochastic errors in bisulfite conversion, 
we identified 2466 non-CpG cytosines 
that were methylated (PM > 10) in both 
replicates of any sample and determined 
how many of these loci were methylated 
in each sample (Supplemental Table S5; 
data in Supplemental Table S6). We found 
the largest number of methylated non- 
CpG cytosines (N = 1418) in the human 



embryonic stem cell line HI (Hl-hESC), consistent with previous 
reports in this cell type (Fig. 5A). Adult human brain tissue had the 
second highest number of methylated non-CpG cytosines, with 
666 non-CpG cytosines methylated in both replicates, and was 
more than twice as high as the following sample (Supplemental 
Table S5). The non-CpG methylation we observed in the brain 
samples occurred at a different set of loci than the non-CpG 
methylation observed in ESCs (Fig. 5 A). This was unexpected in 
light of the recent report of the near complete absence of non-CpG 
cytosine methylation in human somatic cell types, although adult 
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Figure 5. Non-CpG cytosine methylation. (A) We examined 82 cell lines and tissues and identified 
2466 non-CpG cytosines that were methylated in both replicates. The samples with more than 200 
methylated non-CpG cytosines are depicted. The human embryonic cell line (Hl-hESC) contained 1 41 8 
methylated non-CpG cytosines, followed by adult human brain tissue (N = 666), placental tissue (N = 
249), and skeletal muscle from two individuals (female N = 261, male N = 235). (5) The non-CpG 
cytosine methylation identified in the brain tissue was confirmed across post-mortem brain samples 
from 24 different individuals and occurs at a set of loci distinct from those methylated in the other 
samples. (C) The non-CpG cytosine methylation found in the embryonic stem cell line occurred pri- 
marily at the CAG sequence context, consistent with previous reports (Lister et al. 2009). (D) The non- 
CpG cytosine methylation discovered in the adult human brain tissue occurred primarily in the CACC 
sequence context. 
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human brain tissue was not studied (Ziller et al. 2011). The ob- 
servation of non-CpG methylation in somatic tissues challenges 
current hypotheses that a pluripotent-specific regulation or noise 
is involved in establishing this mark. The other samples with more 
than 200 non-CpG cytosines methylated in both replicates in- 
cluded skeletal muscle tissue from two different individuals and 
placental tissue (Fig. 5 A; Supplemental Table S5). To further in- 
vestigate non-CpG cytosine methylation in the brain, we per- 
formed RRBS on brain samples from 24 additional individuals. 
These fresh-frozen samples were collected post-mortem from the 
dorsolateral prefrontal cortex (DLPFC) of healthy control donors as 
part of the Pritzker Neuropsychiatric Disorders Research Consor- 
tium. The non-CpG cytosines that were methylated in the first 
brain sample were also methylated (PM > 10) in these additional 
brain samples (Fig. 5B; data in Supplemental Table S7). The non- 
CpG loci that are methylated in adult brain tissue are distinct from 
the set of non-CpG cytosines that were methylated in ESCs (Fig. 
5B). The non-CpG cytosine methylation that we observed in the 
brain predominately occurs at CAC trinucleotides (Fig. 5D), which 
is different from the CAG trinucleotide context that we and others 
have observed in ESCs (Fig. 5C; Lister et al. 2009; Ziller et al. 2011). 
Non-CpG methylation was recently reported in mouse frontal 
cortex (Xie et al. 2012) in a similar sequence context that we find in 
human brain, suggesting that mice may serve as a valuable 
experimental model for understanding this new methylation 
pattern. We used GREAT (Genomic Regions Enrichment of An- 
notations Tool) (McLean et al. 2010) to determine if these meth- 
ylated non-CpG cytosines are associated with any functional en- 
richment and found that loci in brain tissue are found near genes 
enriched for "blood vessel development" (GO:0001568, GREAT 
hyper FDR q-val = 5.34 X 10~ 6 ), while the loci methylated in hESC 
are found near genes related to small GTPase regulator activity 
(GO:0005083 hyper FDR q-val =4.91 X 10 -4 ), suggesting that these 
events are associated with different gene regulatory processes in 
each cell type. 

Discussion 

We have described an epigenomics resource generated by the 
ENCODE Consortium: large-scale single-base resolution DNA 
methylation profiling on a diverse collection of 82 human cell 
lines and tissues using reduced representation bisulfite sequenc- 
ing. Many of these samples have been characterized with other 
genomic assays by The ENCODE Project Consortium, providing 
a rich resource for exploring functional changes associated with 
DNA methylation. We demonstrated that cell lines grown in rep- 
licate in multiple laboratories display stable DNA methylation 
signatures, that comparing methylation profiles between samples 
identifies methylation profiles relevant to the functional differ- 
ences between cell types, and that these data provide a catalog of 
aberrant methylation found in cancer cell lines. 

We discovered that cancer-specific hypermethylation is en- 
riched at sites where NANOG binds in ESCs. This observation 
complements the previous report of a stem/progenitor signature 
in cancer but expands it beyond loci that are bound by Polycomb 
Repressive Complex 2 to include loci that are bound and activated 
by NANOG, a seemingly contradictory process. Further investi- 
gation is needed to understand when, during the progression of 
cancer, NANOG is present and how would it attract the methylation 
machinery to result in this abenant hypermethylation. Addition- 
ally, we discovered that hypomethylation that is consistent across 
cancer types occurs in megabase-scale domains near the ends of 



chromosomes that contain long tracks of cancer-specific repres- 
sive H3K27me3. Further investigation is needed to understand 
how and why H3K27me3 repression is utilized in these regions 
rather than DNA methylation and whether their location on the 
ends of chromosomes is indicative of a structural mechanism 
or scaffold interaction that leads to their hypomethylation. As 
more genome-scale assays are performed on these samples, we are 
hopeful that further integrated analysis will shed light on the 
causes and consequences of these prolific methylation defects in 
cancer. 

We demonstrate how integrated analysis enabled the quan- 
tification of known relationships between DNA methylation and 
gene expression and describe the characterization of DNA meth- 
ylation at intragenic enhancers. We present a revised model of the 
types of methylation found in the body of expressed and silenced 
genes that includes our finding that unmethylated EP300-bound 
CGIs can be embedded in the densely methylated bodies of 
expressed genes. This model can be used to more accurately predict 
the effects of aberrant DNA methylation found in disease associ- 
ation studies where EP300 binding and gene expression data are 
not available. 

The single-base resolution of RRBS enabled the detection of 
non-CpGs cytosine methylation across the diverse samples in this 
study, which led to the discovery that adult human brain tissue 
from many different individuals contains methylated non-CpG 
cytosines. We find that these loci are different from those pre- 
viously identified in ESCs and occur in a CACC sequence con- 
text, rather than at CAG trinucleotides. This data set provides a 
launching point for the investigation of the mechanisms and 
function of this newly characterized phenomenon. It is intrigu- 
ing that this mark is particularly abundant in brain tissue but 
not in the brain-derived cell lines in our study. It is plausible that 
the non-CpG methylation occurs in a particular type of brain cell 
that was not among the cell lines in this study or that the non- 
CpG methylation is eliminated during cell culture growth. As 
more genomic assay protocols are adapted to work with small 
amounts of tissue, we are hopeful that cell types and factors as- 
sociated with the non-CpG methylation at these loci will be 
revealed. 

Overall, we hope that this atlas of methylation across diverse 
samples, including many commonly used cell line models, proves 
to be a valuable resource for exploring how DNA methylation re- 
lates to other molecular and phenotypic characteristics. 

Methods 

Cell lines and tissues 

Samples included in this study are listed in Supplemental Table SI. 
Detailed information about the samples can be obtained from the 
ENCODE Common Cell Types websites at (http://genome.ucsc. 
edu/ENCODE/cellTypes.html). 

Reduced representation bisulfite sequencing experimental 
procedure 

We modified the previously published protocol for RRBS (Meissner 
et al. 2008) to create a streamlined workflow for this larger-scale 
implementation. We designed the reactions to eliminate the 
phenol extraction and ethanol precipitation steps as well as one of 
the gel extraction steps. We also changed the PCR conditions to 
amplify fragments with diverse GC content and a broad range of 
sizes uniformly. A protocol overview is depicted in Supplemental 
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Figure SI, and specifics are provided as follows. We used the Qiagen 
DNeasy Blood and Tissue Kit to extract DNA. We then digested 
1 |xg genomic DNA with 1 jjlL 20U/|jlL Mspl restriction enzyme 
(New England Biolabs [NEB]) in lx NEBuffer 2 in a total reaction 
volume of 50 |xL. This reaction was incubated at 37°C for 30 min, 
followed by heat inactivation at 80° C for 20 min. We then filled in 
the overhangs and added a 3' A tail by adding dNTP mix to 33 \xM 
and 1 |xL 5U/ |xL Klenow Fragment (3 '-5' exo-) in a total reaction 
volume of 55 |xL. This reaction was incubated at 37°C for 30 min, 
followed by heat inactivation at 75°C for 20 min. We then purified 
the DNA with a Qiagen MinElute column. We purchased two 
methylated DNA oligonucleotides from IDT (www.idtdna.com) as 
follows: ilAdap Methyl PE1 (ACACTCTTTCCCTACACGACGCT 
CTTCCGATC*T) and ilAdap Methyl PE2 (5 ' P-GATCGGAAGAG 
CGGTTC AGC AGGAATGCCG A*G) , where all C's are 5-methyl 
cytosine DNA nucleotides, 5'P indicates a 5' phosphate, and an 
asterisk indicates a phosphorothioate bond. We then annealed 
these oligos to form a stock of 40 |xM duplex DNA adapters in 
a reaction containing 40 |xM ilAdap Methyl PE1, 40 |jlM Methyl 
PE2, 1 X T4 DNA Ligase Buffer (NEB) in a total volume of 50 jiL. We 
incubated this reaction at 95 °C for 5 min, then 70°C for 1 min, 
then 60°C for 1 min, then 50°C for 1 min, then 40°C for 1 min, 
then 30°C for 1 min. We stored these annealed adapters at -30°C 
for future use. We ligated the annealed methylated Illumina 
adapters in a reaction containing 10 |xL purified DNA, lx T4 DNA 
Ligase Buffer (NEB), 1 ^L 400U/^LT4 DNA Ligase (NEB), and 1 ^L 
40 |jlM annealed methylated adapters in a total volume of 20 |xL. 
This reaction was incubated at 20°C for 15 min, followed by heat 
inactivation at 65 °C for 10 min. We electrophoresed the 20 \xL li- 
gation reaction in a 2.5% Seaplaque Agarose (Lonza) gel. The de- 
sired restriction fragments are between 40 and 120 bp, and the 
adapters add 33 bp onto each end of the restriction fragments, so 
we used a razor blade to isolate the agarose gel section containing 
DNA between 106 and 186 bp, while avoiding the adapter self- 
ligation products that appear <100 bp. We then purified the DNA 
using a Qiagen Qiaquick Gel Extraction kit as described in the 
manufacturer's instructions, except that we did not heat the gel 
fragment to dissolve it, and we eluted the purified DNA from the 
column using 22 jjlL buffer EB. We then used 20 \xL of this purified 
DNA in the sodium bisulfite conversion, which was performed 
using the EZ DNA Methylation Gold Kit (Zymo Research). We 
purchased PCR primers that would amplify the adapter-ligated 
DNA and add the cluster generation sequences to the amplicons 
for Illumina sequencing. We purchased these PCR primers from 
IDT as follows: ilPCR PE1 (AATGATACGGCGACCACCGAGATC 
TACACTCTTTCCCTACACGACGCTCTTCCGATC*T) and ilPCR 
PE2 (CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCC 
TGCTGAACCGCTCTTCCGATC*T) , where an asterisk indicates 
a phosphorothioate bond. The DNA that was purified from the 
bisulfite conversion kit was then PCR-amplified in a reaction 
containing 5 U Platinum Taq DNA Polymerase (Invitrogen), 10 X 
PCR Buffer without MgCl 2 (Invitrogen), 2 mM MgCl 2 , 0.5 |xM 
ilPCR PE1 DNA oligonucleotide primer, 0.5 pM ilPCR PE2 DNA 
oligonucleotide primer, 0.5 mM each dNTP and 0.5 M Betaine 
(Sigma-Aldrich) in a total reaction volume of 50 juuL. The reaction 
was incubated at 98°C for 1 min, followed by 20 cycles of (95°C for 
30 sec and 62°C for 3 min). We confirmed the amplification and 
correct product size range by running one-fifth of the reaction on 
a 2% agarose gel. We then purified the remaining PCR product 
with a Qiagen Qiaquick column, eluting in 25 |xL buffer EB, and 
quantified the purified product using the Quant-IT High Sensi- 
tivity dye kit (Invitrogen) on a Qubit fluorometer (Invitrogen). We 
typically obtained 250-750 ng of library material. We then diluted 
each library to 10 nM and proceeded to sequence each library in 
a single lane on the Illumina Genome Analyzer IIx sequencing 



machine according the manufacturers instructions. We typically 
achieved better quality scores and alignment with slightly lower 
cluster density compared to other libraries with more even base- 
representation, and we empirically determined that clustering the 
sample at 5 pM was optimal. 

Sequence alignment and calculating percent methylated value 
for each cytosine 

The sequence data for this project were acquired between June 
2009 and December 2010. On our Illumina Genome Analyzer IIx 
sequencing machine, we sequenced one library per lane and 
obtained between 8 million and 31 million single-end 36-bp reads 
per library. We aligned these reads to a modified reference genome 
sequence that was created to reflect both the reduced representa- 
tion of the genome due to the Mspl restriction digest as well as the 
sodium bisulfite conversion which creates a T in the sequencing 
reads rather than a C at all unmethylated bases. To create this 
reference, we first parsed the hgl9 reference genome to identify all 
of the Mspl restriction fragments <500 bp. We then isolated the 
36-bp ends of these fragments into a fasta file and converted every 
C in the reference to a T and recorded the position of these refer- 
ence cytosines in the name of the reference sequence. To achieve 
optimal alignment that is not biased by the methylation state of 
a molecule, we also created a copy of our sequencing read files, 
converted every C in the read to a T, and recorded the position of 
these read cytosines in the name of the read (Supplemental Fig. S2). 
We then used bowtie (Langmead et al. 2009) to align these con- 
verted reads to the custom reference sequence and required that 
the alignment be optimal and unique in the reference and only 
align to the proper strand (bowtie options -best, -m 1,-norc). On 
average, we uniquely aligned 53.2% (200,000/375,603) of the ge- 
nomic Mspl digest restriction fragments in the selected size range 
(40-120 bp), which resulted in coverage of an average of 1.2 mil- 
lion CpGs in each sample. This is only 8.6% of the 14 million 
nonrepetitive CpGs in the human genome but represents a 1.9- 
fold enrichment for genie regions and a 111-fold enrichment for 
CGIs. We then parsed the alignment file and the encoded read and 
reference names to determine how many reads covered each ref- 
erence cytosine position and what percentage of those reads con- 
tained a C at each reference cytosine position (Supplemental Fig. 
S2). This percent methylated value approximates the percentage of 
molecules in the sample that were methylated at each individual 
cytosine. We compute the bisulfite conversion rate of each sample 
by determining the percent of non-CpG cytosines that are meth- 
ylated (PM > 10), which is an underestimate of the conversion rate 
in samples with biological non-CpG methylation. Each sample 
must meet quality control criteria before data release, including 
a bisulfite conversion rate >98.5%, a complex library with more 
than 500,000 CpGs with at least 10 X coverage, and a correlation 
coefficient of greater than 0.9 between replicates. 

Methyl 450 array methods 

Illumina Methylation450 arrays were run using standard Illumina 
protocols. Briefly, 500 ng of DNA from each cell line was bisulfite- 
converted with the Zymo Research EZ DNA Methylation kit, ampli- 
fied, hybridized, and stained with standard Illumina reagents. The 
intensity data were imported into Illumina's GenomeStudio software, 
and standard beta scores were exported and used in the analysis. 

Analysis of methylation across samples 

Once we compiled the percent methylated values for all cytosines 
shared across samples, we then performed extensive analysis of the 
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trends in these data. Statistical associations including mean cal- 
culations, standard deviation calculations, Pearson correlations, 
binomial tests, Fisher's exact tests, x 2 tests, hypergeometric tests, 
and Kolmogorov-Smirnov tests were calculated using Matlab 
(The Math works, Inc.), and the statistical package R (The R Foun- 
dation for Statistical Computing; http://www.R-project.org). Clus- 
tergrams were created using the average linkage of Euclidean dis- 
tance in Cluster3.0 (de Hoon et al. 2004) and visualized using 
Java Tree View 1.1. 4r3 (Saldanha 2004). For the annotation of cy- 
tosine positions relative to gene features, we used the genomic 
coordinates for gene features from the hgl9 refGene table of the 
UCSC Genome Browser (Fujita et al. 2011; Raney et al. 2011). 
Similarly, we used the genomic positions of the CpG islands track 
on the UCSC Genome Browser to annotate CGI occupancy. To 
measure the Pearson correlation between methylation and ex- 
pression, we excluded CpGs whose methylation did not vary by 10 
PM across the cell lines to avoid spurious correlations to noise in 
the methylation measurements. For identifying transcription fac- 
tor binding sites associated with loci hypermethylated across 
cancer cell lines, we used the compiled supertrack data set con- 
taining binding sites for 149 transcription factors from ENCODE 
ChlP-seq experiments, which is available from the UCSC Genome 
Browser (http://www.genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid= 
305048059&c=chr2&g=wgEncodeRegTfbsClusteredV2). For ana- 
lyzing the binding sites of the specific transcription factors, we 
used individual ChlP-seq data sets. The data for SUZ12 binding 
sites in Hl-hESC are available from the UCSC Genome Browser 
(http://www.genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=305048059& 
g=wgEncodeSydhTfb). The data for NANOG binding sites in Hl- 
hESC are also available from the UCSC Genome Browser (http:// 
www.genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=305048059& 
g=wgEncodeHaibTfbs). For characterizing the hypomethylated 
domains found across cancer cell lines, we used the H3K27me3 
and EZH2 binding ChiP-seq data collected by the Broad Institute 
as part of The ENCODE Project, which were obtained from the 
UCSC Genome Browser (http://www.genome.ucsc.edu/cgi-bin/ 
hgTrackUi?hgsid=286312585&c=chr4&g=wgEncodeBroadHistone). 
The nuclear lamina-associated domains data were obtained from 
the UCSC Genome Browser (http://www.genome.ucsc.edu/cgi-bin/ 
hgTrackUi?hgsid=305048059&c=chr2&g=laminBlSuper). The 
RNA-seq data for the HeLa, hESC HI, K562, HepG2, and GM12878 
cell lines were collected as part of The ENCODE Project and can 
be found under "RNA-seq from ENCODE/Caltech" on the UCSC 
Genome Browser (http://genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid= 
193248635&c=chrl0&g=wgEncodeCaltechRnaSeq). The H3K4me3 
ChlP-seq data for the HeLa, K562, HepG2, and GM12878 cell 
lines were collected as part of The ENCODE Project and can be 
found under "Histone Modifications by ChlP-seq from ENCODE/ 
University of Washington" on the UCSC Genome Browser (http:// 
genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=203697013&c=chr5&g= 
wgEncodeUwHistone). The CAGE tag data from whole cell polyA+ 
fractions of the HeLa, hESC HI, K562, HepG2, and GM12878 cell 
lines were collected as part of The ENCODE Project and can be 
found under (http://genome.ucsc.edu/ cgi-bin/hgTrackUi?hgsid= 
210114571&c=chr21&g=wgEncodeRikenCage). The EP300 ChIP 
binding site information was generated by our group as part of The 
ENCODE Project and can be found under "ENCODE Transcription 
Factor Binding Sites by ChlP-seq from HudsonAlpha Institute" 
on the UCSC Genome Browser (http://genome.ucsc.edu/cgi-bin/ 
hgTrackUi?hgsid=2 10114571 &c=chr2 1 &g=wgEncodeHaibTfbs) . 
Binding sites for EP300 identified in GM12878, HI hESC and 
HepG2 were combined to create a list of potential enhancers. Ge- 
netic polymorphism that created CpG dinucleotides were identified 
as those positions where the bowtie alignment identified the same 
mismatched base in at least 10% of the reads with a minimum read 



depth of five. Functional annotation and enrichment of genes was 
obtained using the gene ontology search program Gorilla (http:// 
cbl-gorilla.cs.technion.ac.il/) (Eden et al. 2009), using the option 
that calculates enrichment in a target gene list over a background list 
of all genes covered in the RRBS libraries. The motif logos repre- 
senting the sequence context of non-CpG cytosine methylation 
were created using WebLogo (http://weblogo.berkeley.edu) (Crooks 
etal. 2004). 

Data access 

The DNA methylation data generated as part of The ENCODE 
Project are available for visualization and download from the 
UCSC Genome Browser (GRCh37/hgl9) (http://www.genome. 
ucsc.edu) under the Regulation heading in the ENCODE DNA 
Methylation tracks. All of the RNA-seq and ChlP-seq data used 
in the analysis are also available for visualization and download 
from the UCSC Genome Browser (links are listed in Supplemental 
Table SI). The DNA methylation data are also available from the 
NCBI Gene Expression Omnibus (GEO) (http://www.ncbi.nlm.nih. 
gov/geo/) through accession numbers GSE27584 and GSE42590. 
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